How it’s Made: Interacting with Gemini through multimodal prompting

2024-04-12 16:36| 来源: 网络整理| 查看: 265

Clue: This country is known for its unique wildlife, including kangaroos and koalas. (Answer: Australia)

Ok, that’s a good clue. Let’s test out whether pointing will work. Just for fun, let’s try pointing at the wrong place first:

Guess:

A person pointing at Brazil on a map of the world with their right index finger. The map includes blue oceans, and green continents with no country borders.

I am pointing at the country of Brazil *incorrect*

Great! Gemini looked at my image and figured out I’m pointing at Brazil, and correctly reasoned that’s wrong. Now let’s point at the right place on the map:

I am pointing at the country of Australia *correct*

Nice! We’ve basically taught Gemini our game logic just by giving it an example. You'll also notice that it generalized from the illustrated hand in the examples.

⌨️ Coding

Of course, to bring your game idea to life, you’ll eventually have to write some executable code. Let’s see if Gemini can make a simple countdown timer for a game, but with a few fun twists:

Implement a simple timer in HTML/CSS/Javascript. Use a sans serif font and dark mode. Start it at 10 seconds and start counting down. When it reaches zero, replace the timer with a random emoji that is associated with excitement and motivation! Then go back to the timer at 10 seconds and start counting down again.

With just this single instruction, Gemini gives us a working timer that does what we asked for:

An animated gif of a countdown timer starting from 10. At the end of the countdown, a rocket emoji is shown followed lightning bolt emoji and a confetti emoji.

My favorite part is scrolling through Gemini’s source code to find the array of motivational emojis it picked for me:

const emojis = ['馃殌', '⚡️', '馃帀', '馃帄', '馃コ', '馃ぉ', '✨'];

馃憖 A sneak peek

Throughout this post, we’ve been giving Gemini an input, and having Gemini make predictions for what might come next. That's basically what prompting is. And our inputs have been multimodal – image and text, combined.

But so far we've only shown Gemini responding in text. Maybe you’re wondering, can Gemini also respond with a combination of image and text? It can! This is a capability of Gemini called “interleaved text and image generation.” While this feature won’t be ready in the first version of Gemini for people to try, we hope to roll it out soon. Here’s a sneak peek of what’s possible.

Let’s see if we could use Gemini to provide everyday creative inspiration. And let’s try it in a domain that requires a bit of multimodal reasoning ... knitting! 馃Ф. Similar to our map game above, let's provide one example turn of interaction:

Give me three ideas for crochet creations I could make with these two colors: Balls of blue and yellow yarn on a wooden table.

Colors: I see blue and yellow yarn.Idea 1: How about a cute dog with one yellow ear? Image 1: A blue and yellow crocheted dog head with ears placed on a wooden table.

A blue and yellow crocheted dog head with ears placed on a wooden table.

Idea 2: Or how about a jellyfish with a few yellow tentacles?Image 2: A blue and yellow crocheted octopus with tentacles placed on a wooden table.

Idea 3: Or how about a blue cat with a yellow nose?Image 3: A blue and yellow crocheted cat with whiskers placed on a wooden table.

We’re essentially teaching Gemini about how we want each interaction to go: “I’ll take a photo of two balls of yarn, and I expect you (Gemini) to both come up with an idea for something I could make, and generate an image of it.”

Now, let’s show it a new pair of yarn colors it hasn't yet seen, and see if it can generalize:

Give me three ideas for crochet creations I could make with these two colors: Balls of blue and pink yarn on a wooden table.

Gemini: Colors: I see blue and pink yarn. Idea 1: How about a pig with blue ears? Image 1: A blue and pink crocheted pig face with ears placed on a wooden table.

A blue and pink crocheted pig face with ears placed on a wooden table.

Idea 2: Or an octopus with pink and blue tentacles?Image 2: A blue and yellow crocheted octopus with tentacles placed on a wooden table.

Idea 3: Or a bunny with a pink nose?Image 3: A blue and pink crocheted bunny placed on a wooden table.

Nice! Gemini correctly reasoned about the new colors (“I see blue and pink yarn”) and generated these ideas and the images in a single, interleaved output of text and image.

What Gemini did here is fundamentally different from today’s text-to-image models. It's not just passing an instruction to a separate text-to-image model. It sees the image of my actual yarn on my wooden table, truly doing multimodal reasoning about my text and image together.

What's Next?

We hope you found this a helpful starter guide to get a sense of what’s possible with Gemini. We’re very excited to roll it out to more people soon so you can explore your own ideas through prompting. Stay tuned!

【本文地址】

公司简介

联系我们